Skip to content

Conversation

@Shu-Wan
Copy link
Member

@Shu-Wan Shu-Wan commented Feb 6, 2026

Summary

This PR adds two classic regression datasets from sklearn to CausalBench for demo and testing purposes.

Datasets Added

1. California Housing Dataset

  • Samples: 20,640
  • Features: 9 (MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude, MedHouseVal)
  • Task: Regression - predict median house values in California districts
  • Source: sklearn.datasets.fetch_california_housing

2. Diabetes Dataset

  • Samples: 442
  • Features: 11 (age, sex, bmi, bp, s1-s6, target)
  • Task: Regression - predict disease progression from physiological variables
  • Source: sklearn.datasets.load_diabetes

Changes

Dataset Files

Each dataset includes:

  • CSV data file with all features and target
  • config.yaml configuration file following CausalBench schema
  • download_data.py script to regenerate data from sklearn

Deliverables

  • ✅ Two regression datasets in causalbench-asu/tests/data/
  • ✅ Compressed .zip files for each dataset
  • ✅ Updated README.md with dataset information

Design Decisions

  • No causal adjacency matrices: These are only required for causal discovery tasks, not regression tasks
  • Classic sklearn datasets: Well-defined, documented, appropriate size for demos
  • Standalone regression tasks: Configured explicitly as regression tasks in descriptions

Testing

All datasets successfully load through the CausalBench framework:

✅ California Housing Dataset: PASSED
✅ Diabetes Dataset: PASSED

Total: 2/2 tests passed

@kapkic
Copy link
Collaborator

kapkic commented Feb 6, 2026

LGTM

@Shu-Wan Shu-Wan marked this pull request as ready for review February 6, 2026 18:15
Copilot AI review requested due to automatic review settings February 6, 2026 18:15
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds two sklearn-sourced regression datasets (California Housing and Diabetes) to the CausalBench test data bundle for demos/testing, along with lightweight download scripts and README documentation updates.

Changes:

  • Added california_housing dataset config + regeneration script (and accompanying data/zip artifacts).
  • Added diabetes dataset config + CSV + regeneration script (and accompanying zip artifact).
  • Updated README dataset table; minor formatting cleanup in zip_files.py.

Reviewed changes

Copilot reviewed 7 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
causalbench-asu/tests/zip_files.py Minor formatting / quoting updates for zip utility.
causalbench-asu/tests/data/diabetes/download_data.py Script to regenerate the Diabetes CSV from sklearn.
causalbench-asu/tests/data/diabetes/diabetes_data.csv Added Diabetes dataset CSV.
causalbench-asu/tests/data/diabetes/config.yaml Added dataset config for Diabetes.
causalbench-asu/tests/data/diabetes.zip Added packaged dataset zip.
causalbench-asu/tests/data/california_housing/download_data.py Script to regenerate the California Housing CSV from sklearn.
causalbench-asu/tests/data/california_housing/config.yaml Added dataset config for California Housing.
README.md Updated dataset list table to include the new datasets + minor whitespace cleanup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 16 to 56
type: continuous
data: decimal
sex:
header: sex
type: continuous
data: decimal
bmi:
header: bmi
type: continuous
data: decimal
bp:
header: bp
type: continuous
data: decimal
s1:
header: s1
type: continuous
data: decimal
s2:
header: s2
type: continuous
data: decimal
s3:
header: s3
type: continuous
data: decimal
s4:
header: s4
type: continuous
data: decimal
s5:
header: s5
type: continuous
data: decimal
s6:
header: s6
type: continuous
data: decimal
target:
header: target
type: continuous
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column type values are set to continuous, but other dataset configs in this repo use values like ratio/nominal (e.g., tests/data/panama/config.yaml). If the dataset loader only recognizes the existing enum values, continuous will fail schema validation or parsing. Please align these column type fields with the accepted values used elsewhere (e.g., use ratio for numeric continuous variables) or update the loader/schema to explicitly support continuous.

Suggested change
type: continuous
data: decimal
sex:
header: sex
type: continuous
data: decimal
bmi:
header: bmi
type: continuous
data: decimal
bp:
header: bp
type: continuous
data: decimal
s1:
header: s1
type: continuous
data: decimal
s2:
header: s2
type: continuous
data: decimal
s3:
header: s3
type: continuous
data: decimal
s4:
header: s4
type: continuous
data: decimal
s5:
header: s5
type: continuous
data: decimal
s6:
header: s6
type: continuous
data: decimal
target:
header: target
type: continuous
type: ratio
data: decimal
sex:
header: sex
type: ratio
data: decimal
bmi:
header: bmi
type: ratio
data: decimal
bp:
header: bp
type: ratio
data: decimal
s1:
header: s1
type: ratio
data: decimal
s2:
header: s2
type: ratio
data: decimal
s3:
header: s3
type: ratio
data: decimal
s4:
header: s4
type: ratio
data: decimal
s5:
header: s5
type: ratio
data: decimal
s6:
header: s6
type: ratio
data: decimal
target:
header: target
type: ratio

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is important. @prat-man @Shu-Wan I don't recall we do have any "type" for datasets. We may need to remove them. Please confirm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type is not used atm
see

for column, col_data in data.columns.items():
if data.headers:
col_df = file_df[col_data.header]
else:
col_df = file_df[col_data._index]
if col_data.data == 'integer':
if not pd.api.types.is_integer_dtype(col_df):
raise TypeError(f'Data type mismatch for column {column}')
if 'labels' in col_data:
labels = sorted(col_data.labels)
data_labels = sorted(file_df[col_data.header].unique())
if labels != data_labels:
raise ValueError(f'Labels do not match for column {column}')
if 'range' in col_data:
start = col_data.range.start
end = col_data.range.end
min1 = min(file_df[col_data.header])
max1 = max(file_df[col_data.header])
if not (start <= min1 <= end and start <= max1 <= end):
raise ValueError(f'Range does not match for column {column}')
elif col_data.data == 'decimal':
if not pd.api.types.is_float_dtype(col_df):
raise TypeError(f'Data type mismatch for column {column}')
if 'labels' in col_data:
labels = sorted(col_data.labels)
data_labels = sorted(file_df[col_data.header].unique())
if labels != data_labels:
raise ValueError(f'Labels do not match for column {column}')
if 'range' in col_data:
start = col_data.range.start
end = col_data.range.end
min1 = min(file_df[col_data.header])
max1 = max(file_df[col_data.header])
if not (start <= min1 <= end and start <= max1 <= end):
raise ValueError(f'Range does not match for column {column}')

existing configs either leave it blank or set it as ratio or nominal

based on schema, it can be defined quite arbitrarily (

type:
anyOf:
- type: string
- type: 'null'
)

I change it to blank

@kapkic @prat-man

update zip_files
update docs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants